Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrap custom iterator result #17251

Open
wants to merge 2 commits into
base: branch-24.12
Choose a base branch
from

Conversation

galipremsagar
Copy link
Contributor

@galipremsagar galipremsagar commented Nov 6, 2024

Description

Fixes: #17165

This PR properly wraps the result of custom iterator.

In [2]: import pandas as pd

In [3]: s = pd.Series([10, 1, 2, 3, 4, 5]*1000000)


# Without custom_iter:

In [4]: %timeit for i in s: True
6.34 s ± 25.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# This PR:

In [4]: %timeit for i in s: True
6.16 s ± 17.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


# On `branch-24.12`:
1.53 s ± 6.27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I think custom_iter has to exist. Here is why, invoking any sort of iteration on GPU objects will raise errors and thus in the end we fall-back to CPU. Instead of trying to move the objects from host to device memory (if the object is on host memory only), we will avoid a CPU-to-GPU transfer.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added Python Affects Python cuDF API. cudf.pandas Issues specific to cudf.pandas labels Nov 6, 2024
@galipremsagar galipremsagar added bug Something isn't working non-breaking Non-breaking change labels Nov 6, 2024
@galipremsagar galipremsagar marked this pull request as ready for review November 6, 2024 04:56
@galipremsagar galipremsagar requested a review from a team as a code owner November 6, 2024 04:56
@galipremsagar galipremsagar changed the title properly wrap custom iterator result Wrap custom iterator result Nov 6, 2024
@vyasr
Copy link
Contributor

vyasr commented Nov 6, 2024

@galipremsagar could you add a test that shows the undesirable fallback behavior of iter? You should be able to do this by monkey-patching os.environ with CUDF_PANDAS_FAIL_ON_FALLBACK and showing that fallback is triggered undesirably. That ought to be possible without a subprocess, but you might have to define a new custom proxy type without custom_iter to show it.

If that turns out to be too much work we can just document it, but I want to make it as obvious as possible to future devs why this is necessary.

@galipremsagar
Copy link
Contributor Author

@galipremsagar could you add a test that shows the undesirable fallback behavior of iter? You should be able to do this by monkey-patching os.environ with CUDF_PANDAS_FAIL_ON_FALLBACK and showing that fallback is triggered undesirably. That ought to be possible without a subprocess, but you might have to define a new custom proxy type without custom_iter to show it.

If that turns out to be too much work we can just document it, but I want to make it as obvious as possible to future devs why this is necessary.

I agree, I'll add a test and document the reason behind the custom iterator existence.

@vyasr
Copy link
Contributor

vyasr commented Nov 7, 2024

This PR also closes #14481 right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf.pandas Issues specific to cudf.pandas non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: In Progress
Status: In Progress
Development

Successfully merging this pull request may close these issues.

[BUG] Incorrect dtype when iterating over dtypes in cudf.pandas
2 participants